From 5d1ea9ff45b77c863495e16c6558a7b9ebb15208 Mon Sep 17 00:00:00 2001 From: Alex Crichton Date: Thu, 23 Oct 2014 12:51:51 -0700 Subject: [PATCH] Add a cache to the registry source This should help avoid hitting disk wherever possible and will also be leveraged in just a second! --- src/cargo/sources/registry.rs | 229 ++++++++++++++++++++++++++++++---- 1 file changed, 204 insertions(+), 25 deletions(-) diff --git a/src/cargo/sources/registry.rs b/src/cargo/sources/registry.rs index 4152d9dbc..ad21312be 100644 --- a/src/cargo/sources/registry.rs +++ b/src/cargo/sources/registry.rs @@ -1,3 +1,163 @@ +//! A `Source` for registry-based packages. +//! +//! # What's a Registry? +//! +//! Registries are central locations where packages can be uploaded to, +//! discovered, and searched for. The purpose of a registry is to have a +//! location that serves as permanent storage for versions of a crate over time. +//! +//! Compared to git sources, a registry provides many packages as well as many +//! versions simultaneously. Git sources can also have commits deleted through +//! rebasings where registries cannot have their versions deleted. +//! +//! # The Index of a Registry +//! +//! One of the major difficulties with a registry is that hosting so many +//! packages may quickly run into performance problems when dealing with +//! dependency graphs. It's infeasible for cargo to download the entire contents +//! of the registry just to resolve one package's dependencies, for example. As +//! a result, cargo needs some efficient method of querying what packages are +//! available on a registry, what versions are available, and what the +//! dependencies for each version is. +//! +//! One method of doing so would be having the registry expose an HTTP endpoint +//! which can be queried with a list of packages and a response of their +//! dependencies and versions is returned. This is somewhat inefficient however +//! as we may have to hit the endpoint many times and we may have already +//! queried for much of the data locally already (for other packages, for +//! example). This also involves inventing a transport format between the +//! registry and Cargo itself, so this route was not taken. +//! +//! Instead, Cargo communicates with registries through a git repository +//! referred to as the Index. The Index of a registry is essentially an easily +//! query-able version of the registry's database for a list of versions of a +//! package as well as a list of dependencies for each version. +//! +//! Using git to host this index provides a number of benefits: +//! +//! * The entire index can be stored efficiently locally on disk. This means +//! that all queries of a registry can happen locally and don't need to touch +//! the network. +//! +//! * Updates of the index are quite efficient. Using git buys incremental +//! updates, compressed transmission, etc for free. The index must be updated +//! each time we need fresh information from a registry, but this is one +//! update of a git repository that probably hasn't changed a whole lot so +//! it shouldn't be too expensive. +//! +//! Additionally, each modification to the index is just appending a line at +//! the end of a file (the exact format is described later). This means that +//! the commits for an index are quite small and easily applied/compressable. +//! +//! ## The format of the Index +//! +//! The index is a store for the list of versions for all packages known, so its +//! format on disk is optimized slightly to ensure that `ls registry` doesn't +//! produce a list of all packages ever known. The index also wants to ensure +//! that there's not a million files which may actually end up hitting +//! filesystem limits at some point. To this end, a few decisions were made +//! about the format of the registry: +//! +//! 1. Each crate will have one file corresponding to it. Each version for a +//! crate will just be a line in this file. +//! 2. There will be two tiers of directories for crate names, under which +//! crates corresponding to those tiers will be located. +//! +//! As an example, this is an example hierarchy of an index: +//! +//! ```notrust +//! . +//! ├── 3 +//! │   └── u +//! │   └── url +//! ├── bz +//! │   └── ip +//! │   └── bzip2 +//! ├── config.json +//! ├── en +//! │   └── co +//! │   └── encoding +//! └── li +//!    ├── bg +//!    │   └── libgit2 +//!    └── nk +//!    └── link-config +//! ``` +//! +//! The root of the index contains a `config.json` file with a few entries +//! corresponding to the registry (see `RegistryConfig` below). +//! +//! Otherwise, there are three numbered directories (1, 2, 3) for crates with +//! names 1, 2, and 3 characters in length. The 1/2 directories simply have the +//! crate files underneath them, while the 3 directory is sharded by the first +//! letter of the crate name. +//! +//! Otherwise the top-level directory contains many two-letter directory names, +//! each of which has many sub-folders with two letters. At the end of all these +//! are the actual crate files themselves. +//! +//! The purpose of this layou tis to hopefully cut down on `ls` sizes as well as +//! efficient lookup based on the crate name itself. +//! +//! ## Crate files +//! +//! Each file in the index is the history of one crate over time. Each line in +//! the file corresponds to one version of a crate, stored in JSON format (see +//! the `RegistryPackage` structure below). +//! +//! As new versions are published, new lines are appended to this file. The only +//! modifications to this file that should happen over time are yanks of a +//! particular version. +//! +//! # Downloading Packages +//! +//! The purpose of the Index was to provide an efficient method to resolve the +//! dependency graph for a package. So far we only required one network +//! interaction to update the registry's repository (yay!). After resolution has +//! been performed, however we need to download the contents of packages so we +//! can read the full manifest and build the source code. +//! +//! To accomplish this, this source's `download` method will make an HTTP +//! request per-package requested to download tarballs into a local cache. These +//! tarballs will then be unpacked into a destination folder. +//! +//! Note that because versions uploaded to the registry are frozen forever that +//! the HTTP download and unpacking can all be skipped if the version has +//! already been downloaded and unpacked. This caching allows us to only +//! download a package when absolutely necessary. +//! +//! # Filesystem Hierarchy +//! +//! Overall, the `$HOME/.cargo` looks like this when talking about the registry: +//! +//! ```notrust +//! # A folder under which all registry metadata is hosted (similar to +//! # $HOME/.cargo/git) +//! $HOME/.cargo/registry/ +//! +//! # For each registry that cargo knows about (keyed by hostname + hash) +//! # there is a folder which is the checked out version of the index for +//! # the registry in this location. Note that this is done so cargo can +//! # support multiple registries simultaneously +//! index/ +//! registry1-/ +//! registry2-/ +//! ... +//! +//! # This folder is a cache for all downloaded tarballs from a registry. +//! # Once downloaded and verified, a tarball never changes. +//! cache/ +//! registry1-/-.tar.gz +//! ... +//! +//! # Location in which all tarballs are unpacked. Each tarball is known to +//! # be frozen after downloading, so transitively this folder is also +//! # frozen once its unpacked (it's never unpacked again) +//! src/ +//! registry1-/-/... +//! ... +//! ``` + use std::io::{mod, fs, File}; use std::io::fs::PathExtensions; use std::collections::HashMap; @@ -28,11 +188,18 @@ pub struct RegistrySource<'a, 'b:'a> { handle: Option, sources: Vec, hashes: HashMap<(String, String), String>, // (name, vers) => cksum + cache: HashMap>, } #[deriving(Decodable)] pub struct RegistryConfig { + /// Download endpoint for all crates. This will be appended with + /// `///download` and then will be hit with an HTTP GET + /// request to download the tarball for a crate. pub dl: String, + + /// API endpoint for the registry. This is what's actually hit to perform + /// operations like yanks, owner modifications, publish new crates, etc. pub api: String, } @@ -71,6 +238,7 @@ impl<'a, 'b> RegistrySource<'a, 'b> { handle: None, sources: Vec::new(), hashes: HashMap::new(), + cache: HashMap::new(), } } @@ -185,6 +353,39 @@ impl<'a, 'b> RegistrySource<'a, 'b> { Ok(dst) } + /// Parse the on-disk metadata for the package provided + fn summaries(&mut self, name: &str) -> CargoResult<&Vec<(Summary, bool)>> { + if self.cache.contains_key_equiv(&name) { + return Ok(self.cache.find_equiv(&name).unwrap()); + } + // see module comment for why this is structured the way it is + let path = self.checkout_path.clone(); + let path = match name.len() { + 1 => path.join("1").join(name), + 2 => path.join("2").join(name), + 3 => path.join("3").join(name.slice_to(1)).join(name), + _ => path.join(name.slice(0, 2)) + .join(name.slice(2, 4)) + .join(name), + }; + let summaries = match File::open(&path) { + Ok(mut f) => { + let contents = try!(f.read_to_string()); + let ret: CargoResult>; + ret = contents.as_slice().lines().filter(|l| l.trim().len() > 0) + .map(|l| self.parse_registry_package(l)) + .collect(); + try!(ret.chain_error(|| { + internal(format!("Failed to parse registry's information \ + for: {}", name)) + })) + } + Err(..) => Vec::new(), + }; + self.cache.insert(name.to_string(), summaries); + Ok(self.cache.find_equiv(&name).unwrap()) + } + /// Parse a line from the registry's index file into a Summary for a /// package. /// @@ -223,32 +424,10 @@ impl<'a, 'b> RegistrySource<'a, 'b> { impl<'a, 'b> Registry for RegistrySource<'a, 'b> { fn query(&mut self, dep: &Dependency) -> CargoResult> { - let name = dep.get_name(); - let path = self.checkout_path.clone(); - let path = match name.len() { - 1 => path.join("1").join(name), - 2 => path.join("2").join(name), - 3 => path.join("3").join(name.slice_to(1)).join(name), - _ => path.join(name.slice(0, 2)) - .join(name.slice(2, 4)) - .join(name), - }; - let contents = match File::open(&path) { - Ok(mut f) => try!(f.read_to_string()), - Err(..) => return Ok(Vec::new()), - }; - - let ret: CargoResult>; - ret = contents.as_slice().lines().filter(|l| l.trim().len() > 0) - .map(|l| self.parse_registry_package(l)) - .collect(); - let summaries = try!(ret.chain_error(|| { - internal(format!("Failed to parse registry's information for: {}", - dep.get_name())) - })); - let mut summaries = summaries.into_iter().filter(|&(_, yanked)| { + let summaries = try!(self.summaries(dep.get_name())); + let mut summaries = summaries.iter().filter(|&&(_, yanked)| { dep.get_source_id().get_precise().is_some() || !yanked - }).map(|(summary, _)| summary).collect::>(); + }).map(|&(ref s, _)| s.clone()).collect::>(); summaries.query(dep) } } -- 2.30.2